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Abstract 

This paper presents a kernel-based discriminative learning framework on probability mea- 
sures. Rather than relying on large collections of vectorial training examples, our framework 
learns using a collection of probability distributions that have been constructed to meaningfully 
represent training data. By representing these probability distributions as mean embeddings in 
the reproducing kernel Hilbert space (RKHS), we are able to apply many standard kernel-based 
learning techniques in straightforward fashion. To accomplish this, we construct a generaliza- 
tion of the support vector machine (SVM) called a support measure machine (SMM). Our 
analyses of SMMs provides several insights into their relationship to traditional SVMs. Based 
on such insights, we propose a flexible SVM (Flex-SVM) that places different kernel func- 
tions on each training example. Experimental results on both synthetic and real-world data 
demonstrate the effectiveness of our proposed framework. 

1 Introduction 

Discriminative learning algorithms are typically trained from large collections of vectorial training 
examples. In many classical learning problems, however, it is arguably more appropriate to represent 
training data not as individual data points, but as probability distributions. There are, in fact, 
multiple reasons why probability measures may be preferable. 

Firstly, uncertain or missing data naturally arises in many applications. For example, gene 
expression data obtained from the microarray exp eriments are known to be very noisy due to 



various sources of variabilities ([Yang fc Speed! 120021) . In order to reduce uncertainty, and to allow 



for estimates of confidence levels, experiments are often replicated. Unfortunately, the feasibility of 
replicating the microarray experiments is often inhibited by cost constraints, as well as the amount 
of available mRNA. To cope with experimental uncertainty given a limited amount of data, it is 
natural to represent each array as a probability distribution that has been designed to approximate 
the variability of gene expressions across slides. 
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Table 1: The analytic forms of expected kernels for different choices of kernels and distributions 
whose mean and covariance are m and S. 
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Probability distributions may be equally appropriate given an abundance of training data. In 
data-rich disciplines such as neuroinformatics, climate informatics, and astronomy, a high through- 
put experiment can easily generate a huge amount of data, leading to significant computational 
challenges in both time and space. Instead of scaling up one's learning algorithms, one can scale 
down one's dataset by constructing a smaller collection of distributions which represents groups of 
similar samples. Besides computational efficiency, aggregate statistics can potentially incorporate 
higher-level information that represents the collective behavior of multiple data points. 

Previous attempts have been made to learn from distributions by creating positive definite (p.d.) 
kernels on probability measures. Jebara et al. ( 2004h proposed the probability product kernel (PPK) 
as a generalized inner product between two input objects. Thi s kernel can be show n to be closely 
related to well-known kernels such as the Bhattacharyya kernel (Bhattacharvvalll943) and the expo- 



ryv 

nential symmetrized Kullback-Leibler (KL) divergence (|Moreno et al.l 12004V In Hein fc Bousquetl 
( 20051) . an extension of a two-parameter family of Hilbertian metr ics of Tops0e was used to de- 
fine Hilbertian kernels on probability measures. ICuturi et al. ( 2005 ) proposed the semi-group ker- 
nels designed for ob jects with additive semi-group structure such as positive measures. Recently, 
Martins et"aD (|2009h introduced noncxtensive information theoretic kernels on probability measures 



based on new Jensen-Shannon-type divergences. Although these kernels have proven successful in 
many applications, they are designed specifically for certain properties of distributions and appli- 
cation domains. Moreover, there has been no attempt in making a connection to the kernels on 
corresponding input spaces. 

The contributions of this paper are summarized as follows. First, we prove the representer 
theorem for a regularization framework over the space of probability distributions, which is a 
generalization of regularization over the input space on which the distributions are defined. Second, 
a family of positive definite kernels on distributions is introduced. Based on such kernels, a learning 
algorithm on probability measures called support measure machine (SMM) is proposed. An SVM 
on the input space is provably a special case of the SMM. Third, the paper presents the relations 
between sample-based and distribution-based methods. If the distributions depend only on the 
locations in the input space, the SMM particularly reduces to a more flexible SVM that places 
different kernels on each data point. 

The remainder of this paper is organized as follows. Section [5] introduces a regularization 
framework on probability measures. A family of kernels on probability distributions is presented 
in Section [3] Theoretical analyses are subsequently presented in Section 21 followed by a summary 
of related works in Section [SJ In Section [BJ experimental results on both synthetic and real- world 
datasets are presented with discussions. Finally, we conclude the paper with Section [7] 
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2 Distributive Regular izat ion 



Given a non-empty set X, let 3? denote the set of all probability measures P on a measurable 
space (X,A), where A is a er-algebra of subsets of X. The goal of this work is to learn a function 
h : — > y given a set of example pairs {(Pi, 1; where Pj £ & and t/j £y. In other words, 

we consider a supervised setting in which input training examples are probability distributions. In 
this paper, we focus on the binary classification problem, i.e., y = {+1, — 1}. 

In order to learn from distributions, we employ a compact representation that not only preserves 
necessary infor mation of individual distributions, but also permits efficient computations. Unlike 



previous works (jJebara et al.l 12004 ICuturi et al.l I2005L Martins et al.l I2009T) . we adopt a Hilbert 



space embedding to represent the distribution as a point in an RKHS (jBerlinet &: Agnanl 12004 . 



Smola et al.1 120071 ). Formally, let H denote an RKHS of functions / : X -> R. 
reproducing kernel k : X x X ^> R. The mean map from & into H is defined as 



x 



k(x, -)dP(x) 



endowed with a 



(1) 



We assume that k(x, •) is bounded for any x £ X. It can be shown that, if k is c haracteristic, the map 
(fTJ) is injective, i.e., all the information about the distribution is preserved (jSriperumbudur et al 
1201 Ol) . For any P, letting ^p = /i(P), we have that 



ip[/] = <Mp,/>«, V/eW 



(2) 



The mean embedding /ip is a feature map associated with the kernel K : x 3 s — > R, defined as 
AT(P,Q) = (mp,MQ>«- Since sup,, < oo, it also follows that X(P, Q) = //(fc(i,-),% r )) M dP(a;) 

// fc(x,t/)dP(x)dQ(t/), where the second equality follows from the reproducing property of H. It 
is immediate that if is a p.d. kernel on SP. 

The following representer theorem shows that optimal solutions of a suitable class of regulariza- 
tion problems involving probability distributions can be expressed as a finite linear combination of 
mean embeddings. 



Theorem 1. Given training examples (Pj,t/i) 6 #xl,t= 1,. 
increasing function O : [0, +00) — > R, and a loss function I : x 
minimizing the regularized risk functional 



, . ,m, a strictly monotonically 
l 2 ) m -> R U {+00}, am/ / € "H 



£ (Pr, W> E Fl [/],..., P m ,» ro ,E, m [/]) + O (11/11 



(3) 



1, 



admits a representation of the form f — X)"=i a iA*Pi f or some on £ R, 

Proof. By virtue of Proposition 2 in ISriperumbudur et al. (|2010h . the linear functional Ep[-1 are 
bounded for all P G J 2 . Then, given Pi,P2, ...,P m , any f £ H can be decomposed as / = +/ 
where £ H lives in the span of /ip i5 i.e., / M = ^^.j ai/up; and / £ % satisfying, for all j, 
(/ x ,/iP 3 .) = 0. Hence, for all j, we have E P . [/] = E Pj [f„ + / x ] = (/„ + / x ,/ip 3 .) = + 
(f ± ,[J.p J ) = {fn,HVj) which is independent of f^. As a result, the loss functional £ in (J3j) does 
not depend on f- 1 . For the regularization functional f2, since / x is orthogonal to J2iLi Qif-Ti an d 

Q is strictly monotonically increasing, we have = + / x ||) - ^(VII/mII 2 + ll/^ll 2 ) > 

^(11 /a* II) w hh equality if and only if / =0 and thus / = Consequently, any minimizer must 
take the form / = J2iLi = Eti a i E Pi IM 35 ) ')]• B 
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Theorem [T] clearly indicates how each distribution contributes to the minimizer of ([3]). Roughly 
speaking, the coefficients on controls the contribution of the distributions through the mean em- 
beddings fi¥ i . Furthermore, if we restrict 3? to a class of Dirac measures 8 X on X and consider 
the training set U8 X , , Vi)\^, the functional ((SJ reduces to the usual regularization functional 
(jScholkopf et al.l 120011 1 and the solution reduces to / = Y^u=\ ctik(xi, •). Therefore, the standard 
representer theorem is recovered as a particular case. 

Note that, on the one hand, the minimization problem Q is different from minimizing the func- 
tional Epj . . . Ef m £(xx, y%, f(x\), . . . , x m ,y m , f(x m )) + f2(||/||^) for the special case of the additive 
loss I. Therefore, the solution of the above regularization problem is different from what one would 
get in the limit by training an infinitely many points sampled from Pi, ... , P m . On the other hand, 
it is also different from minimizing the functional t{M\, y\, /(Mi), . . . , M m ,y m , f(M m )) + Q(||/||^) 
where Mi = E^pjx]. In a sense, our framework is something in between. 



3 Kernels on Distributions 

As the map ([1} is linear in <5^, optimizing the functional ([3]) amounts to finding a function in % that 
can approximate well functions from & 1 to R in the function class J r = {P— >/ A ,gdP|Pe g G 
C(X)} where C{X) is a class of bounded continuous functions on X. Since S x G £P for any x G X, 
it also follows that C(X) C J- C C(&) where C(&>) is a class of bounded continuous functions 
on & endowed with the topology of weak convergence and the associated Borel er-algebra. The 
following lemma states the relation between the RKHS T-L induced by the kernel k and the function 
class T. 

Lemma 2. The RKHS % induced by a kernel k is dense in J- if k is universal, i.e., for every 
function F G T and every e > there exists a function g G H with sup Pe 32 |_F(P) — J gd¥\ < e. 

Proof. Assume that k is universal. Then, for every function / G C( X) and every e > there 
exists a function g G % induced by k with sup a;g ^|/(a;) — g(x)\ < s (|SteinwartJ 2001 ). Hence, 



by linearity of T , for every F £ T and every e > there exists a function h G H such that 
sup Pe , £3a |F(P) - Jhd¥\ < e. ■ 

Nonlinear kernels on can be defined in an analogous way to nonlinear kernels on X, by 
treating mean embeddings /ip of P G as its feature representation. First, assume that the map 
HI is injective and let ( •, •) &> be an inner product on & . By linearity, we have (P, Q) g> = (/zp, hq)-h 
(cf. IBerlinet fc Agnanl (|2004l) for more details) . Then, the nonlinear kernels on & can be defined as 
K(P, Q) = fc(/xj», /xq) = (tp(fJip),ip(HQ))Hz with k a p.d. kernel. As a result, many standard nonlinear 
kernels on X, e.g., Gaussian RBF kernel, Cauchy kernel, and generalized T-student kernel, can be 
used to define nonlinear kernels on 5? as long as their evaluation depends entirely on (^p, fiq)n- 
Although requiring more computational effort, their practical use is simple and flexible. Specifically, 
the notion of p.d. kernels on distributions proposed in this work is quite generic. Standard kernel 
functions can be reused to derive kernels on distributions that are different from many other kernel 
fu nctions proposed specifical l y for certain distributions. 

IChristmann fc Steinwart ( 2010l ) recently proved that the Gaussian RBF kernel given by K(P, Q) = 



exp(— 3r||/ip — A*qII«)j VP, Q G 8? is universal w.r.t C(^) given that the map fi is injective. De- 
spite its success in real- world applications, the theory of kerne l-based classifiers beyond the input 
space X C K d , as mentioned by lChristmann fc Steinwart] ( 2010h is still incomplete. It is therefore of 



theoretical interest to consider more general classes of universal kernels on probability distributions. 
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3.1 Support Measure Machines 



This subsection extends SVMs to deal with probability distributions, leading to support measure 
machines (SMMs). In its most general form, an SMM amounts to solving an SVM problem with 
the expected kernel K(¥i,Pj) = ^• X i~v i ,xj~F j [k(xi,Xj)]. This can be computed in closed-form for 
certain classes of distributions and kernels k. Examples are given in Table [1] 

Alternatively, one can approximate the kernel A(P, Q) by the empirical estimate: 

_^ n m 

A cmp (P„, Q m ) = V V k{x hZj ) (4) 

n ■ m ^— f 

i=i j=l 

where P n and Q m are empirical distributions of P and Q given random samples {xi}™ =1 and {zj}™ =1 , 
respectively. A finite sample of size m from a distribution P suffices to (with high probability) 
compute an approximation within an error of 0(m~2 ). Instead, if the sample set is sufficient, one 
may choose to approximate the true distribution by simpler probabilistic models, e.g., a mixture 
of Gaussians model, and choose a kernel k whose expected value admits an analytic form. Storing 
only the parameters of probabilistic models may save some space compared to storing all examples. 

Table[2]depicts the general forms of linear and nonlinear kernels for sample-based and distribution- 
based methods. We will also refer to these definitions to avoid an ambiguity between linear (resp. 
nonlinear) SVM and linear (resp. nonlinear) SMM. Moreover, for clarity, the kernel k used to define 
mean embeddings of will be called embedding kernels, whereas the kernel K that induces 
real- valued functions on & will be called level-2 kernels. 



4 Theoretical Analyses 
4.1 Risk Deviation Bound 

Given a training sample {(Pi, 2/i)}"=i drawn i.i.d. from some unknown probability distribution 
V on @* x y, a loss function I : R x R — >• R, and a function class A, the goal of statisti- 
cal learning is to find the function / € A that minimizes the expected risk functional 7Z(f) = 
f&> fx ^(y> dP(a;) d"P(P, y). As it cannot be optimized directly due to not knowing V, the 
empirical risk lZ em p(f) = h YmLi fx ^{Vi-> f( x )) dPi(ac) based on the training sample is considered 
instead. Furthermore, the risk functional can be simplified further by considering Y^TLi ^2 X ~p 
£(yi, f(xij)) based on n samples iy drawn according to each Pj. 

Our framework, on the other hand, alleviates the problem by minimizing the risk functional 
IZ^(f) — £(y, Ep[/(a;)]) d'P(P, y) for / € % with corresponding empirical risk functional 7^f mp (/) = 
— J27Li £{yii^Pi[f{x)}) (cf. the discussion at the end of Section [2]). It is often easier to optimize 
JZ-empif) as * ne expectation can be computed exactly for certain choices of Pi and H. Moreover, for 
universal H, this simplification preserves all information of the distributions. Nevertheless, there is 
still a loss of information due to the loss function i. 

Due to the i.i.d. assumption, the analysis of the difference between 1Z and TV 1 can be simplified 
w.l.o.g. to the analysis of the difference between Ep[£(y, f(x))] and £(y, ¥.f[f{x)]) for a particular 
distribution P € 3*. The theorem below provides a bound on the difference between Kp[£(y, f(x))] 
and£(y,E r [f(x)}). 
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Table 2: The associated kernels for sample-based and distribution-based methods. 





Linear kernel 


Nonlinear kernel 


Sample 


x T x' 


{4>{x),cj>{x')) u 


Distribution 







Theorem 3. Given an arbitrary probability distribution P with variance a 1 , a Lipschitz continuous 
function f : R — ¥ R with constant Cf, an arbitrary loss function £ : R x R — > R that is Lipschitz 
continuous in the second argument with constant C'e, with probability at least 1 — 5, it follows that 
K-rWfa f(x))] - £{y, E x ^ r [f(x)})\ < 2C e C f a/V6 for anyyeR. 

Proof. Assume that x is distributed according to P. Then, the variances of f(x) and £(y, f(x)) are 
upper bounded by Cja 2 and C|C|ct 2 , respectively. By the triangle inequality, we have |Ep[£(y, f(x))] — 
£(y,E P [f(x)])\ < \E P [£(y,f(x))}-£(yJ(x))\ + \£(yJ(x))-£(y,E P [f(x)})\. Next, by Chebyshevs in- 
equality, we have with probability at least 1 — 5 that \Ep[£(y , f (x))] — £{y,f{x))\ < CiCfaj\Jl> 
and \£(y,f(x)) - l{y,E v [f{x)])\ < C t \f(x) - E ¥ [f(x)]\ < C e C f a/VS. Consequently, we have 
\E T [t(yJ(x))]-£(y,E P [f(x)])\ < 2C e C f o-/VS. ■ 

Theorem [3] indicates that if the random variable x is concentrated around its mean and the 
function / and £ are well-behaved, i.e., Lipschitz continuous, then the loss deviation |Ep[£(y, f(x))] — 
£(y, Ep[f(x)])\ will be small. As a result, if this holds for any distribution Pj in the training set 
{(Pi, the true risk deviation \1Z — 7?. M | is also expected to be small. 

4.2 Flexible SVMs 

It turns out that, for certain choices of distributions P, the linear SMM trained using {(Pi, yi)}^™ x 
is equivalent to an SVM trained using samples {(^i,yi)}™i with an appropriate choice of kernel 
function. 

Lemma 4. Let k{x,y) be a bounded p. d. kernel on a measure space such that J J k(x,y) 2 dxdy < 
oo, and g(x,x) be a square integrable function such that J g(x,x)dx < oo for all x. Given 
a sample {(Pi,J/i)}t=i where each Pi is assumed to have a density given by g{xi,x), the lin- 
ear SMM is equivalent to the SVM on the training sample {(^i,J/i)}™i with kernel K g {x,y) = 

II K x > y)g( x , x )g(y, y) dx A v ■ 

Proof. For a training sample {(xi, 2/i)}™ 1; the SVM with kernel K g minimizes £({xi,yi, f{xi) + 
b}iLi) + M\f\\n K ■ ^ the representer theorem, f(x) = Y^iLi a iK g {x,Xj) with some at € R, 
hence this is equivalent to £({xi, yi, X^=i a jK g {xi, Xj) + b}™^^ + XJ2Tj=i a i a jKg{ x i, x j)- Next, 
consider the kernel mean of the probability measure g{xi,x)dx given by ^ = J k(-, x)g{xi, x) Ax 
and note that (^, i ,f)- Hk = J f(x)g(xi,x)dx for any / e rlk- The linear SMM with loss £ and 
kernel k minimizes ^({Pi, yi, (^i, f)u k + + By Theorem Q], each minimizer / admits 

a representation of the form / = Y^'j=i a jl JL j = a j I x )d( x j : j x ) dx. Thus, for this / 

we have (Vi,f)u k = EjLi a j II k(y,x)g(x l ,x)g(x j , y) dx dy = T,]Lx^j K g( x U x j) and \\f\\n h = 
Yn.j=i a * a i (w> Mi) = Sij=i OLiOijKg^i, xj), as above. ■ 

Note that the important assumption for this equivalence is that the distributions P,; differ only 
in their location in the input space. This need not be the case in all possible applications of SMMs. 



G 



Furthermore, the kernel K g can be written as K g (x, y) — {f k(x, )g{x, x) dx, J k(y, -)g(y, y) dy) H - 
Thus, it is clear that the feature map of x depends not only on the kernel k, but also on the density 
g(x,x). Consequently, by virtue of Lemma IH the kernel K g allows the SVM to place different 
kernels at each data point. We call this algorithm a flexible SVM. 

For example, if we consider the linear SMM with Gaussian distributions Af(xr, af-T),. . . , N(x m ; tr^ • 
I) and Gaussian RBF kernel k a i with bandwidth parameter a, the convolution theorem of Gaussian 
distributions implies that this SMM is equivalent to a flexible SVM that places a kernel fc a .2 +2(T 2 (a;,, •) 
on training example Gaussian RBF kernel with larger bandwidth. 



5 Related Works 



The k ernel K(F, Q) = (fj,f, pq)u is in fact a special case of the Hilbertian metric (jHein fc Bousquet 
l2005h . with the assoc iated kernel K ( F, Q) = Kp q[k(x,x)}, and a generative mean map kernel 
(GMMK) proposed bv lMehta fc Gravi (|2010l) . In the GMMK, the kernel between two objects x and 
y is defined via p x and p y , which are estimated probabilistic models of x and y, respectively. That is, 
a probabilistic model p x is learned for each example and used as a surrogate to construct the kernel 
between those examples. The idea of surro gate kernels has also been adopted by the Probability 
Product Kernel (PPK) (jJebara et al.l 120041) . In this case, we have K n (p,p r ) = f y p(x) p p '(x) p dx, 
which has been shown to be a special case of GMMK when p = 1 (jMehta fc Gravi l2010f ). Conse- 
quently, GMMK and PPK with p = 1 are equivalent to the linear kernels between distributions 
proposed in this work. 

The use of ex pected kernels in dealing with the uncertainty in the input data has a connection 
to robust SVMs. IShivaswamv et al.l (|2006l ). for instance, proposed a generalized form of the SVM 
that incorporates the probabilistic uncertainty into the maximization of the margin. This results in 
a second-order cone programming (SOCP) that generalizes the standard SVM. In SOCP, one needs 
to specify the parameter ti that reflects the probability of correctly classifying the ith training 
example. The parameter t; is therefore closely related to the parameter <jj , which specifies the 
variance of the distribution centered at the ith example. lAnderson fc Guptal ( 201 lh showed the 
equivalence between SVMs using expected kernels and SOCP when = 0. When > 0, the 
mean and covariance of missing kernel entries have to be estimated explicitly, making the SOCP 
more involved for nonlinear kernels. Although achieving comparable performance to the standard 
SVM with expected kernels, the SOCP requires a more computationally extensive SOCP solver, as 
opposed to simple quadratic programming (QP). 



6 Experimental Results 

In the experiments, we primarily consider three different learning algorithms: i) SVM is considered 
as a baseline algorithm, ii) Augmented SVM (ASVM) is an SVM trained on augmented samples 
drawn according to the distributions {Pi}™!- The same number of examples are drawn from 
each distribution, iii) SMM is our distribution-based method that can be applied directly on the 
distributions!]]- 

We used the LIBSVM implementation. 
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SVM ASVM SMM 




Figure 1: Decision boundaries of SVM, ASVM, and SMM. 



6.1 Synthetic Data 

Firstly, we conducted a basic experiment that illustrates a fundamental difference between SVM, 
ASVM, and SMM. A binary classification problem of 7 Gaussian distributions with different means 
and covariances was considered. We trained the SVM using only the means of the distributions, 
ASVM with 30 virtual examples generated from each distribution, and SMM using distributions as 
training examples. A Gaussian RBF kernel with 7 = 0.25 was used for all algorithms. 

Figure Q] shows the resulting decision boundaries. Having been trained only on means of the 
distributions, the SVM classifier tends to overemphasize the regions with high densities and un- 
derrepresent the lower density regions. In contrast, the ASVM is more expensive and sensitive to 
outliers, especially when learning on heavy-tailed distributions. The SMM treats each distribution 
as a training example and implicitly incorporates properties of the distributions, i.e., means and 
covariances, into the classifier. Note that the SVM can be trained to achieve a similar result to 
the SMM by choosing an appropriate value for 7 (cf. Lemma 2]). Nevertheless, this becomes more 
difficult if the training distributions are, for example, nonisotropic and have different covariance 
matrices. 

Secondly, we evaluate the performance of the SMM for different combinations of embedding 
and level-2 kernels. Two classes of synthetic Gaussian distributions on R 10 were generated. The 
mean parameters of the positive and negative distributions are normally distributed with means 
m + = (1, . . . , I) and mT — (2, . . . , 2) and identical covariance matrix E = 0.5 • I10, respectively. 
The covariance matrix for each distribution is generated according to two Wishart distributions 
with covariance matrices given by E + = 0.6 • Iiq and E _ = 1.2 • I10 with 10 degrees of freedom. 
The training set consists of 500 distributions from the positive class and 500 distributions from 
the negative class. The test set consists of 200 distributions with the same class proportion to the 
training set. 

The kernels used in the experiment include linear kernel (LIN), polynomial kernel of degree 2 
(POLY2), polynomial kernel of degree 3 (POLY3), unnormalized Gaussian RBF kernel (RBF), and 
normalized Gaussian RBF kernel (NRBF). To fix parameter values of both kernel functions and 
SMM, 10-fold cross-validation (10-CV) is performed on a parameter grid, C E {2~ 3 ,2~ 2 , . . . , 2 7 } 
for SMM, bandwidth parameter 7 g {10~ 3 , 10 -2 , . . . , 10 2 } for Gaussian RBF kernels, and degree 
parameter d € {2, 3, 4, 5, 6} for polynomial kernels. The average accuracy and ±1 standard deviation 
for all kernel combinations over 30 repetitions are reported in Table [31 Moreover, we also investigate 
the sensitivity of kernel parameters for two kernel combinations: RBF-RBF and POLY-RBF. In this 
case, we consider the bandwidth parameter 7 = {10~ 3 , 10~ 2 , . . . , 10 3 } for Gaussian RBF kernels 
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Table 3: Accuracies (%) of SMM on synthetic data with different combinations of embedding and 
level-2 kernels. 
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88.06±1.73 
86.84±1.51 
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78.28±2.19 
89.65±1.37 
86.86±1.88 
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Figure 2: The heatmap plots of average accuracies of SMM over 30 experiments using POLY-RBF 
(center) and RBF-RBF (right) kernel combinations with the plots of average accuracies at different 
parameter values (left). 
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Figure 3: The performance of SVM, ASVM, and SMM 
algorithms on handwritten digits constructed using three 
basic transformations. 





















■ - 





2000 4000 6000 

Number of virtual examples 



Figure 4: Relative computational 
| cost of ASVM and SMM (baseline: 
- SMM with 2000 virtual examples). 
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Figure 5: Accuracies of four dif- 
ferent techniques for natural scene 
categorization. 



and degree parameter d = {2, 3, . . . , 8} for polynomial kernels. Figure [2] depicts the accuracy values 
and average accuracies for considered kernel functions. 
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The results in Table [3] indicate that both embedding and level-2 kernels are important for the 
performance of the classifier. The embedding kernels tend to have more impact on the predictive 
performance compared to the level-2 kernels. This conclusion also coincides with the results depicted 
in Figure El 

6.2 Handwritten Digit Recognition 

In this section, the proposed framework is applied to distributions over equivalence classes of images 
that are invariant to basic transformations, namely, scaling, translation, and rotation. We consider 
the handwritten digits obtained from the USPS dataset. For each 16 x 16 image, the distribution 
over the equivalence class of the transformations is determined by a prior on parameters associated 
with such transformations. Scaling and translation are parametrized by the scale factors (s x , s y ) and 
displacements (t x ,t y ) along the x and y axes, respectively. The rotation is parametrized by an angle 
9. We adopt Gaussian distributions as prior distributions, including 7V([1, 1], 0.1 T2), A/"([0, 0], 5T2), 
and 7V(0; 7r). For each image, the virtual examples are obtained by sampling parameter values from 
the distribution and applying the transformation accordingly. 

Experiments are categorized into simple and difficult binary classification tasks. The former con- 
sists of classifying digit 1 against digit 8 and digit 3 against digit 4. The latter considers classifying 
digit 3 against digit 8 and digit 6 against digit 9. The initial dataset for each task is constructed 
by randomly selecting 100 examples from each class. Then, for each example in the initial dataset, 
we generate 10, 20, and 30 virtual examples using the aforementioned transformations to construct 
virtual data sets consisting of 2,000, 4,000, and 6,000 examples, respectively. One third of examples 
in the initial dataset are used as a test set. The original examples are excluded from the virtual 
datasets. The virtual examples are normalized such that their feature values are in [0, 1]. Then, to 
reduce computational cost, principle component analysis (PCA) is performed to reduce the dimen- 
sionality to 16. We compare the SVM on the initial dataset, the ASVM on the virtual datasets, 
and the SMM. For SVM and ASVM, the Gaussian RBF kernel is used. For SMM, we employ the 
empirical kernel (j4j with Gaussian RBF kernel as a base kernel. The parameters of the algorithms 
are fixed by 10-CV over parameters C G {2~ 3 , 2" 2 , . . . , 2 7 } and 7 G {0.01, 0.1, 1}. 

The results depicted in Figure [3] clearly demonstrate the benefits of learning directly from the 
equivalence classes of digits under basic transformation^. In most cases, the SMM outperforms 
both the SVM and the ASVM as the number of virtual examples increases. Moreover, Figure [4] 
shows the benefit of the SMM over the ASVM in term of computational cosid. 

6.3 Natural Scene Categorization 

This section illustrates benefits of the nonlinear kernels between distributions for learning natural 
scene categories in which the bag-of-word (BoW) representation is used to represent images in 
the dataset. Each image is represented as a collection of local patches, each being a codeword 
from a large vocabulary of codewords called codebook. Standard BoW representations encode each 
image as a histogram that enumerates the occurrence probability of local patches detected in the 
image w.r.t. those in the codebook. On the other hand, our setting represents each image as a 

2 While the reported results were obtained using virtual examples with Gaussian parameter distributions (Sec. 
16,21 1, we got similar results using uniform distributions. 

3 The evaluation was made on a 64-bit desktop computer with Intel® Core™ 2 Duo CPU E8400 at 3.00Ghzx2 
and 4GB of memory. 
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distribution over these codewords. Thus, images of different scenes tends to generate distinct set 
of patches. Based on this representation, both the histogram and the local patches can be used in 
our framework. 

We use the dataset presented in iFei-feil (2005). According to their results, most errors occurs 
among the four indoor categories (830 images), namely, bedroom (174 images), living room (289 
images), kitchen (151 images), and office (216 images). Therefore, we will focus on these four 
categories. For each category, we split the dataset randomly into two separate sets of images, 100 
for training and the rest for testing. 

A codebook is formed from the training images of all categories. Firstly, interesting keypoints 
in the image are randomly detected. Local patches are then ge nerated acc ordingly. After patch 
detection, each patch is transformed into a 128-dim SIFT vector (lLowelll999h . Given the collection 
of detected patches, K-means clustering is performed over all local patches. Codewords are then 
defined as the centers of the learned clusters. Then, each patch in an image is mapped to a codeword 
and the image can be represented by the histogram of the codewords. In addition, we also have an 
M x 128 matrix of SIFT vectors where M is the number of codewords. 

We compare the performance of a Probabilistic Latent Semantic Analysis (pLSA) with the 
standard BoW representation, SVM, linear SMM (LSMM), and nonlinear SMM (NLSMM). For 
SMM, we use the empirical embedding kernel with Gaussian RBF base kernel k: K(hi,hj) = 
Sr=i S«=i hi{c r )hj{c s )k{c r , c s ) where is the histogram of the ith image and c r is the rth SIFT 
vector. A Gaussian RBF kernel is also used as the level-2 kernel for nonlinear SMM. For the SVM , 
we adopt a Gaussian RBF kernel with x 2 -distance between the histograms (jVedaldi et al. I l2009h . 



i.e., K(hi,hj) = exp (-7x 2 (h,, h,-)) where x 2 (h;,h.,) = Y^Li {h h-(c!)+h-(c!) ■ The parameters of 
the algorithms are fixed by 10-CV over parameters C £ {2~ 3 , 2~ 2 , . . . , 2 7 } and 7 £ {0.01, 0.1, 1}. 
For NLSMM, we use the best 7 of LSMM in the base kernel and perform 10-CV to choose 7 
parameter only for the level-2 kernel. To deal with multiple categories, we adopt the pairwise 
approach and voting scheme to categorize test images. The results in Figure [5] illustrate the benefit 
of the distribution-based framework. Understanding the context of a complex scene is challenging. 
Employing distribution-based methods provides an elegant way of utilizing higher-order statistics 
in natural images that could not be captured by traditional sample-based methods. 



7 Conclusions 

This paper proposes a method for kernel-based discriminative learning on probability distributions. 
The trick is to embed distributions into an RKHS, resulting in a simple and efficient learning 
algorithm on distributions. A family of linear and nonlinear kernels on distributions allows one 
to flexibly choose the kernel function that is suitable for the problems at hand. Our analyses 
provide insights into the relations between distribution-based methods and traditional sample- 
based methods, particularly the flexible SVM that allows the SVM to place different kernels on 
each training example. The experimental results illustrate the benefits of learning from a pool of 
distributions, compared to a pool of examples, both on synthetic and real-world data. 
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